Analysis of White Wine by Chester Fung

========================================================

Objective of the Project

The objective of the project is to understand the different elements contributing to the quality of white wine. By breaking down the different elements (features) and analyze their relationships, we want to understand how much each feature affects the quality of the white wine.

Structure of the data set

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

This data set contains 4,898 white wines with 13 variables on quantifying the chemical properties of each wine. The quality of each wine is between 0 (very bad) and 10 (very excellent).

Other observations: The median quality of white wine is 6.00.
The mean quality is 5.88.
Max quality is 9.00 About 75% of white wine has a quality equal to or less than 6.00

Summary of the data set

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The summary shows that the distribution of all the features. Interesting observations include the quality of most white wines fall between 5 and 6, with average alcohol level of 10.51.

Histogram plots of all the features

Let’s take a quick look at the distribution plots of all the features by using grid.arrange

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

other chemical properties, including fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol might contribute to the quality of white wine.

Let’s take a look at the descriptions of these properties:

fixed acidity - have direct influences on teh color, balance, and taste of the wine volatile acidity - aka wine fault, is an unpleasant characteristic of a wine resulting from poor winemaking practices or storage conditions, and leading to wine spoilage citric acidity - weak organic tribasic acid residual suguar - influence how sweet a wine will taste, measured in grams of sugar per litre of wine chlorides - free sulfur dioxide - serves as an antibiotic and antioxidant, protecting wine from spoilage by bacteria dn oxidation. It helps minimize volatie acidity total sulfur dioxide - refers to both free and bound SO2 density - proportional to the sugar content and will be expected to fall as the sugar is converted into alcohol by fermentation pH - strength of acidity sulphates - added as a preservatives to prevent spoilage and oxidation at several stages of the winemaking. Without sulfites, grape juice would quickly turn to vinegar alcohol - amount of alcohol

Did you create any new variables from existing variables in the dataset?

wine quality

And let’s look at the “quality” variable specifically

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
##  int [1:4898] 6 6 6 6 6 6 6 6 6 6 ...

Bar charts Quality

The above bar chart shows the distribution of white wine quality. Quality of 6 has the most number and most of the white wine falls between quality of 5 to 7.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I had to transform the ‘quality’ variable into ‘rating’, by separating into groups based on the quality values. This allows a clearer scattersplot to show the relationship between alcohol level and quality. See plots in later section

Let’s take a look at how the acid variables affect the quality

Based on the charts, fixed acidity usually falls between 6.3 and 7.3. Citric acid falls between 0.27 and 0.39. Voltile acid falls between 0.21 and 0.32. pH values fall between 3.09 and 3.28.

Fixed acidity, citric acid and pH values appear to be normal distributions, except volatile

Let’s do a log transformation for volatile acidity:

The log transformation of volatile acidity now follows a normal distribution

Histograms of Density, Chlorides, Residual Sugar, Alcohol

Chlorides - falls between and . Right-skewed Residual sugar - falls between and . Right-skewed Density - falls between Alcohol falls between

Let’s do log transformation for chlorides and residual sugar

Chlorides after log transformation now looks more normal

Let’s examine the acid variables, starting with fixed acid vs citric acid:

## 
##  Pearson's product-moment correlation
## 
## data:  wwine$fixed.acidity and wwine$citric.acid
## t = 21.137, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2633067 0.3146389
## sample estimates:
##       cor 
## 0.2891807

Fixed acid and citric acid has a postiive correlation of 0.289

Volatile acid vs citric acid:

## 
##  Pearson's product-moment correlation
## 
## data:  wwine$volatile.acidity and wwine$citric.acid
## t = -9.3688, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1601217 -0.1050945
## sample estimates:
##        cor 
## -0.1327103

Volatile acid and citric acid has a negative relationship of -0.177

free.sulfur.dioxide vs total.sulfur.dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wwine$free.sulfur.dioxide and wwine$total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

free sulfur dioxide and total sulfur dioxide has a positive correlation of 0.616

fixed acidity vs quality

## 
##  Pearson's product-moment correlation
## 
## data:  wwine$fixed.acidity and wwine$quality
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14121974 -0.08592991
## sample estimates:
##        cor 
## -0.1136628

fixed acidity and quality has a negative relationship of -0.114

Examine the coefficients of different features against quality

## 
##  Pearson's product-moment correlation
## 
## data:  wwine$fixed.acidity and as.numeric(wwine$quality)
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14121974 -0.08592991
## sample estimates:
##        cor 
## -0.1136628
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$volatile.acidity and as.numeric(wwine$quality)
## t = -14.087, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2241308 -0.1702981
## sample estimates:
##        cor 
## -0.1973632
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$citric.acid and as.numeric(wwine$quality)
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03720595  0.01880221
## sample estimates:
##          cor 
## -0.009209091
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$residual.sugar and as.numeric(wwine$quality)
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$chlorides and as.numeric(wwine$quality)
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$free.sulfur.dioxide and as.numeric(wwine$quality)
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01985292  0.03615626
## sample estimates:
##         cor 
## 0.008158067
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$total.sulfur.dioxide and as.numeric(wwine$quality)
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2017563 -0.1474524
## sample estimates:
##        cor 
## -0.1747372
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$density and as.numeric(wwine$quality)
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$pH and as.numeric(wwine$quality)
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$sulphates and as.numeric(wwine$quality)
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02571007 0.08156172
## sample estimates:
##        cor 
## 0.05367788
## 
##  Pearson's product-moment correlation
## 
## data:  wwine$alcohol and as.numeric(wwine$quality)
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

Analysis:
Positve relationship with quality - free sulfur dioxide, pH, sulphates, alcohol Negative relationship with quality - fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, total sulfur dioxide, density,

What was the strongest relationship you found?

Strongest relationship was between quality and alcohol, cor.test returns 0.435

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

all of the features were measured against quality, with “alcohol” having the largest coefficient (0.435) Fixed acidity (-0.114), volatile acidity(-0.195), citric acid(-0.009), residual sugar (-0.09), cholrides (-0.2), free sulfur dioxide(0.008), total sulfur dioxide(-0.174), density(-.307), pH(0.09), sulphates(0.05) all have relatively weak correlationship with quality

Based on the description of the features, volatile acidity (wine fault), having a negative coefficient makes sense since the more volatie acidity, the worse of the wine quality. This is expected, however, I was suprised that it’s -0.19 and thought it would be higher

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The correlation coefficient between ‘free.sulfur.dioxide’ and ‘total.sulfur.dioxide’ is 0.615. This is somewhat to be expected since one of the subset of another. Upon further review, the formula is the following:

total sulfur dioxide = free sulfur dioxide + bound sulfur dioxide

Sulfur dioxide is used as a preservative because of its anti-oxidative and anti-microbial properties, and also as a cleaning agent for barrels and winery facilities.

Other relationships were explored as well, including volative acidity and citric acid (cor.test = -0.149), fixed acidity and citric acid (cor.test = 0.289)

The following plots we’ll examine the relationship between quality and other features. But instead of using quality, we’ll use our new variable ‘rating’.

Impact of alcohol and pH on White Wine Rating:

We can see that there’s a trend on alcohol content and rating the higher the alcohol content, the higher the rating.

A lot of the bad rating (red color) appear on the left side with low alcohol content. When we move toward the middle of the chart (alcohol content between 10% to 12%), there’s a lot of green rating. The great ones (blue) appear on the right side when alcohol % is over 12%

Let’s switch variable from pH to density

Impact of alcohol and desnity on White Wine Rating:

this graph shows a similar trend as the previous one, where the higher the alcohol content, the higher the rating. The density ranges between 0.99 to 1.0

Let’s switch variable from density to volatile acidity:

Impact of volatile acidity and alcohol on White Wine Rating

Above graph continues with the trend that the higher the alcohol %, the higher the rating. This graph also shows that the higher the volatile acidity level is, the more ‘bad’ white wine rating, which is consistent with our understanding of volatile acidity

Impact of citric acidity and alcohol on White Wine Rating

This graph continues with the trend that the higher the alcohol %, the higher the rating. This graph also shows that most of the citric acid falls between 0 to 0.5

Impact of sulphates and alcohol on White Wine Rating

This graph continues with the trend that the higher the alcohol %, the higher the rating.

Impact of chlorides and pH on White Wine Rating

Let’s take a deeper look of effects of different variables and alcohol on rating:

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol and pH value were evaluated against ‘rating’, which is a superset of ‘quality’. The scatterplot shows a clear distinction that the ‘bad’ rating wines are concentrated in the lower alchol level, and the “good” rating wines are more concentrated in the higher alcohol level

Were there any interesting or surprising interactions between features?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I had to transform the ‘quality’ variable into ‘rating’, by separating into groups based on the quality values. This allows a clearer scattersplot to show the relationship between alcohol level and quality. See plots in later section

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

linear regression was created using the ‘lm’ function. However, the R-squred value turned out to be 0.22, which is relatively low. Therefore, additional features need to be included to see if it improves the R-squred value

## List of 11
##  $ call         : language lm(formula = I(alcohol) ~ I(quality), data = wwine)
##  $ terms        :Classes 'terms', 'formula' length 3 I(alcohol) ~ I(quality)
##   .. ..- attr(*, "variables")= language list(I(alcohol), I(quality))
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "I(alcohol)" "I(quality)"
##   .. .. .. ..$ : chr "I(quality)"
##   .. ..- attr(*, "term.labels")= chr "I(quality)"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(I(alcohol), I(quality))
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "I(alcohol)" "I(quality)"
##  $ residuals    :Class 'AsIs'  Named num [1:4898] -1.788 -1.088 -0.488 -0.688 -0.688 ...
##   .. ..- attr(*, "names")= chr [1:4898] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 6.9567 0.6052 0.1063 0.0179 65.4702 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "I(quality)"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "I(quality)"
##  $ sigma        : num 1.11
##  $ df           : int [1:3] 2 4896 2
##  $ r.squared    : num 0.19
##  $ adj.r.squared: num 0.19
##  $ fstatistic   : Named num [1:3] 1146 1 4896
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.0092 -0.00153 -0.00153 0.00026
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "I(quality)"
##   .. ..$ : chr [1:2] "(Intercept)" "I(quality)"
##  - attr(*, "class")= chr "summary.lm"

Final Plots and Summary

Plot One

Description One

At the start of the analysis, we determined that quality is the most important element as it affects the prices of white wines. We also know that white wines are assigned into different quality scores. But how are the quality scores distributed? We first start with a univariate graph of showing the distribtuion of white wine quality. The next 2 graphs will then show how other features impact the quality score.

Plot Two

Description Two

We then take a deeper look at the bivariate graph of showing the correlation between alcohol level and quality(rating). As the graph shows there’s a trend of higher alchol content leading to higher quality. Let’s look at a multivariate graph next showing how multiple graphs affect the quality(rating)

Plot Three

Description Three

Last graph is a multivariate graph showing correlationship between density, alcohol and quality(rating) of wine. The trend continues that the higher the alcohol level, the higher the quality(rating) of the wine. And with density, most wine falls between 0.99 to 1.0

Reflection

This was a good exercise to learn how to explore and find out relationships between different feature in a dataset. Before the analysis, I thought features such as volatile.acidity or dentiy would have the most impact on the quality of the wine. But when the cor.test analysis was performed, alcolhol shows the highest correlation to quality with 0.435. The cor.test also shows other interesting relationships including positive relationships with quality (free sulfur dioxide, pH, sulphates) and negative relationships with quality (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, total sulfur dioxide, density). The cor.test was a very good starting point for the analysis. It provides a direction of what further analysis can be done.

We then try to confirm this by exploring this relationship with bivariate and multivariate graphs. Various bivariate and multivariate graphs were produced and further strengthened the positive relationship with alcohol and quality (rating). Different variables were also included in the multivariate plot and also showed the trend of higher alcohol level and higher quality (rating)

Future Analysis

Two tasks that I would like to continue working on this dataset for future analysis.

  1. Explore other postiive/negative relationships with quality.

  2. Create models. Currently I just performed a quick analysis using lm function. I would like to explore this more by using differet algorithms and check which one has the best accuracy in prediction

References

Dataset: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt Role of sulfur dioxide: https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/ sulfur dioxide in wine making: https://en.wikipedia.org/wiki/Sulfur_dioxide#In_winemaking acids in wine: https://en.wikipedia.org/wiki/Acids_in_wine